Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | [your name here] | [your id here] | [your email here] |
| Student 2 | [your name here] | [your id here] | [your email here] |
In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game. Then we'll focus our attention on image generation and implement two different generative models: A variational autoencoder and a generative adversarial network.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.In the tutorial we have seen value-based reinforcement learning, in which we learn to approximate the action-value function $q(s,a)$.
In this exercise we'll explore a different approach, directly learning the agent's policy distribution, $\pi(a|s)$ by using policy gradients, in order to safely land on the moon!
%load_ext autoreload
%autoreload 2
%matplotlib inline
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Prefer CPU, GPU won't help much in this assignment
device = 'cpu'
print('Using device:', device)
# Seed for deterministic tests
SEED = 42
Using device: cpu
Some technical notes before we begin:
xvfb-run command to create a virtual screen. For example,srun do
srun -c2 --gres=gpu:1 xvfb-run -a -s "-screen 0 1440x900x24" python main.py run-nb <filename>
srun -c2 xvfb-run -a -s "-screen 0 1440x900x24" python main.py prepare-submission ...
xvfb-run command inside the jupyter-lab.sh script, so you can use it as usual with srun.
and so on.gym library is not officially supported on windows. However it should be possible to install and run the necessary environment for this exercise. However, we cannot provide you with technical support for this. If you have trouble installing locally, we suggest running on the course server.gym environment locally (i.e. not on the course server), an interactive window should appear, showing you the gameplay. There's currently a known issue when running this through jupyter: the window may remain open and seem stuck after the episode completes. If it happens, this is OK, you can keep running the notebook and the rest of the cells wont be affected. The Window will close properly when you shut down the kernel.Recall from the tutorial that we define the policy of an agent as the conditional distribution, $$ \pi(a|s) = \Pr(a_t=a\vert s_t=s), $$ which defines how likely the agent is to take action $a$ at state $s$.
Furthermore we define the action-value function, $$ q_{\pi}(s,a) = \E{g_t(\tau)|s_t = s,a_t=a,\pi} $$ where $$ g_t(\tau) = r_{t+1}+\gamma r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+1+k}, $$ is the total discounted reward of a specific trajectory $\tau$ from time $t$, and the expectation in $q$ is over all possible trajectories, $ \tau=\left\{ (s_0,a_0,r_1,s_1), \dots (s_T,a_T,r_{T+1},s_{T+1}) \right\}. $
In the tutorial we saw that we can learn a value function starting with some random function and updating it iteratively by using the Bellman optimality equation. Given that we have some action-value function, we can immediately create a policy based on that by simply selecting an action which maximize the action-value at the current state, i.e. $$ \pi(a|s) = \begin{cases} 1, & a = \arg\max_{a'\in\cset{A}} q(s,a') \\ 0, & \text{else} \end{cases}. $$ This is called $q$-learning. This approach aims to obtain a policy indirectly through the action-value function. Yet, in most cases we don't actually care about knowing the value of particular states, since all we need is a good policy for our agent.
Here we'll take a different approach and learn a policy distribution $\pi(a|s)$ directly - by using policy gradients.
We define a parametric policy, $\pi_\vec{\theta}(a|s)$, and maximize total discounted reward (or minimize the negative reward): $$ \mathcal{L}(\vec{\theta})=\E[\tau]{-g(\tau)|\pi_\vec{\theta}} = -\int g(\tau)p(\tau|\vec{\theta})d\tau, $$ where $p(\tau|\vec{\theta})$ is the probability of a specific trajectory $\tau$ under the policy defined by $\vec{\theta}$.
Since we want to find the parameters $\vec{\theta}$ which minimize $\mathcal{L}(\vec{\theta})$, we'll compute the gradient w.r.t. $\vec{\theta}$: $$ \grad\mathcal{L}(\vec{\theta}) = -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau. $$
Unfortunately, if we try to write $p(\tau|\vec{\theta})$ explicitly, we find that computing it's gradient with respect to $\vec{\theta}$ is quite intractable due to a huge product of terms depending on $\vec{\theta}$: $$ p(\tau|\vec{\theta})=p\left(\left\{ (s_t,a_t,r_{t+1},s_{t+1})\right\}_{t\geq0}\given\vec{\theta}\right) =p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t). $$
However, by using the fact that $\grad_{x}\log(f(x))=\frac{\grad_{x}f(x)}{f(x)}$, we can convert the product into a sum: $$ \begin{align} \grad\mathcal{L}(\vec{\theta}) &= -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau = -\int g(\tau)\frac{\grad p(\tau|\vec{\theta})}{p(\tau|\vec{\theta})}p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left(p(\tau|\vec{\theta})\right)p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left( p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\left( \log p(s_0) + \sum_{t\geq0} \log \pi_{\vec{\theta}}(a_t|s_t) + \sum_{t\geq0}\log p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t) p(\tau|\vec{\theta})d\tau \\ &= \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. \end{align} $$
This is the "vanilla" version of the policy gradient. We can interpret is as a weighted log-likelihood function. The log-policy is the log-likelihood term we wish to maximize and the total discounted reward acts as a weight: high-return positive trajectories will cause the probability of actions taken during them to increase, and negative-return trajectories will cause the probabilities of actions taken to decrease.
In the following figures we see three trajectories: high-return positive-reward (green), low-return positive-reward (yellow) and negative-return (red) and the action probabilities along the trajectories after the update. Credit: Sergey Levine.
![]() |
![]() |
The major drawback of the policy-gradient is it's high variance, which causes erratic optimization behavior and therefore slow convergence. One reason for this is that the log-policy weight term, $g(\tau)$ can vary wildly between different trajectories, even if they're similar in actions. Later on we'll implement the loss and explore some methods of variance reduction.
In the spirit of the recent achievements of the Israeli space industry, we'll apply our reinforcement learning skills to solve a simple game called LunarLander.
This game is available as an environment in OpenAI gym.
In this environment, you need to control the lander and get it to land safely on the moon. To do so, you must apply bottom, right or left thrusters (each are either fully on or fully off) and get it to land within the designated zone as quickly as possible and with minimal wasted fuel.
import gym
# Just for fun :) ... but also to re-define the default max number of steps
ENV_NAME = 'Beresheet-v2'
MAX_EPISODE_STEPS = 300
if ENV_NAME not in gym.envs.registry.env_specs:
gym.register(
id=ENV_NAME,
entry_point='gym.envs.box2d:LunarLander',
max_episode_steps=MAX_EPISODE_STEPS,
reward_threshold=200,
)
import gym
env = gym.make(ENV_NAME)
print(env)
print(f'observations space: {env.observation_space}')
print(f'action space: {env.action_space}')
ENV_N_ACTIONS = env.action_space.n
ENV_N_OBSERVATIONS = env.observation_space.shape[0]
<TimeLimit<LunarLander<Beresheet-v2>>> observations space: Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32) action space: Discrete(4)
The observations at each step is the Lander's position, velocity, angle, angular velocity and ground contact state. The actions are no-op, fire left truster, bottom thruster and right thruster.
You are highly encouraged to read the documentation in the source code of the LunarLander environment to understand the reward system,
and see how the actions and observations are created.
Let's start with our policy-model. This will be a simple neural net, which should take an observation and return a score for each possible action.
TODO:
PolicyNet class in the hw4/rl_pg.py module.
Start small. A simple MLP with a few hidden layers is a good starting point. You can come back and change it later based on the the experiments.build_for_env method to instantiate a PolicyNet based on the configuration of a given environment.part1_pg_hyperparams() in hw4/answers.py.print(env.observation_space.sample())
print(env.unwrapped.action_space)
[ 2.5857196 -0.9514567 0.05757594 -0.4913913 1.2619035 -0.69068193 0.80056274 1.9852805 ] Discrete(4)
import hw4.rl_pg as hw4pg
import hw4.answers
hp = hw4.answers.part1_pg_hyperparams()
# You can add keyword-args to this function which will be populated from the
# hyperparameters dict.
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
p_net
PolicyNet(
(fc): Sequential(
(0): Linear(in_features=8, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=4, bias=True)
)
)
Now we need an agent. The purpose of our agent will be to act according to the current policy and generate experiences.
Our PolicyAgent will use a PolicyNet as the current policy function.
We'll also define some extra datatypes to help us represent the data generated by our agent.
You can find the Experience, Episode and TrainBatch datatypes in the hw4/rl_data.py module.
TODO: Implement the current_action_distribution() method of the PolicyAgent class in the hw4/rl_pg.py module.
for i in range (10):
agent = hw4pg.PolicyAgent(env, p_net, device)
d = agent.current_action_distribution()
test.assertSequenceEqual(d.shape, (env.action_space.n,))
test.assertAlmostEqual(d.sum(), 1.0, delta=1e-5)
print(d)
tensor([0.2590, 0.2205, 0.2669, 0.2537])
TODO: Implement the step() method of the PolicyAgent.
agent = hw4pg.PolicyAgent(env, p_net, device)
exp = agent.step()
test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-0.0035, 1.4218, -0.3593, 0.4822, 0.0041, 0.0814, 0.0000, 0.0000]), action=3, reward=1.3705129569685834, is_done=False)
To test our agent, we'll write some code that allows it to play an environment. We'll use the Monitor
wrapper in gym to generate a video of the episode for visual debugging.
TODO: Complete the implementation of the monitor_episode() method of the PolicyAgent.
env, n_steps, reward = agent.monitor_episode(ENV_NAME, p_net, device=device)
To display the Monitor video in this notebook, we'll use a helper function from our jupyter_utils and a small wrapper that extracts the path of the last video file.
import cs236781.jupyter_utils as jupyter_utils
def show_monitor_video(monitor_env, idx=0, **kw):
# Extract video path
video_path = monitor_env.videos[idx][0]
video_path = os.path.relpath(video_path, start=os.path.curdir)
# Use helper function to embed the video
return jupyter_utils.show_video_in_notebook(video_path, **kw)
print(f'Episode ran for {n_steps} steps. Total reward: {reward:.2f}')
show_monitor_video(env, idx=0)
Episode ran for 61 steps. Total reward: -85.92
The next step is to create data to train on. We need to train on batches of state-action pairs, so that our network can learn to predict the actions.
We'll split this task into three parts:
Episodes, by using an Agent that's playing according to our current policy network.
Each Episode object contains the Experience objects created by the agent.Episodes into a batch of tensors to train on.
Each batch will contain states, action taken per state, reward accrued, and the calculated estimated state-values.
These will be stored in a TrainBatch object.TODO: Complete the implementation of the episode_batch_generator() method in the TrainBatchDataset class within the hw4.rl_data module. This will address part 1 in the list above.
import hw4.rl_data as hw4data
def agent_fn():
env = gym.make(ENV_NAME)
hp = hw4.answers.part1_pg_hyperparams()
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
return hw4pg.PolicyAgent(env, p_net, device)
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
batch_gen = ds.episode_batch_generator()
b = next(batch_gen)
print('First episode:', b[0])
test.assertEqual(len(b), 8)
for ep in b:
test.assertIsInstance(ep, hw4data.Episode)
# Check that it's a full episode
is_done = [exp.is_done for exp in ep.experiences]
test.assertFalse(any(is_done[0:-1]))
test.assertTrue(is_done[-1])
First episode: Episode(total_reward=-174.26, #experences=65)
TODO: Complete the implementation of the calc_qvals() method in the Episode class.
This will address part 2.
These q-values are an estimate of the actual action value function: $$\hat{q}_{t} = \sum_{t'\geq t} \gamma^{t'-t}r_{t'+1}.$$
np.random.seed(SEED)
test_rewards = np.random.randint(-10, 10, 100)
test_experiences = [hw4pg.Experience(None,None,r,False) for r in test_rewards]
test_episode = hw4data.Episode(np.sum(test_rewards), test_experiences)
qvals = test_episode.calc_qvals(0.9)
qvals = list(qvals)
expected_qvals = np.load(os.path.join('tests', 'assets', 'part1_expected_qvals.npy'))
for i in range(len(test_rewards)):
test.assertAlmostEqual(expected_qvals[i], qvals[i], delta=1e-3)
TODO: Complete the implementation of the from_episodes() method in the TrainBatch class.
This will address part 3.
Notes:
TrainBatchDataset class provides a generator function that will use the above function to lazily generate batches of training samples and labels on demand.PyTorch dataloader to wrap our Dataset and provide us with parallel data loading for free!
This means we can run multiple environments with multiple agents in separate background processes to generate data for training and thus prevent the data loading bottleneck which is caused by the fact that we must generate full Episodes to train on in order to calculate the q-values.DataLoader's batch_size to None because we have already implemented custom batching in our dataset.num_workers parameter in the hyperparams dict. Set num_workers=0 to disable parallelization.from torch.utils.data import DataLoader
hp = hw4.answers.part1_pg_hyperparams()
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
dl = DataLoader(
ds,
batch_size=None,
num_workers=hp['num_workers'],
multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)
for i, train_batch in enumerate(dl):
states, actions, qvals, reward_mean = train_batch
print(f'#{i}: {train_batch}', end="\n\n")
test.assertEqual(states.shape[0], actions.shape[0])
test.assertEqual(qvals.shape[0], actions.shape[0])
test.assertEqual(states.shape[1], env.observation_space.shape[0])
if i > 1:
break
#0: TrainBatch(states: torch.Size([754, 8]), actions: torch.Size([754]), q_vals: torch.Size([754])), num_episodes: 8) #1: TrainBatch(states: torch.Size([768, 8]), actions: torch.Size([768]), q_vals: torch.Size([768])), num_episodes: 8) #2: TrainBatch(states: torch.Size([732, 8]), actions: torch.Size([732]), q_vals: torch.Size([732])), num_episodes: 8)
As usual, we need a loss function to optimize over. We'll calculate three types of losses:
We have derived the policy-gradient as $$ \grad\mathcal{L}(\vec{\theta}) = \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$
By writing the discounted reward explicitly and enforcing causality, i.e. the action taken at time $t$ can't affect the reward at time $t'<t$, we can get a slightly lower-variance version of the policy gradient:
$$ \grad\mathcal{L}_{\text{PG}}(\vec{\theta}) = \E[\tau]{-\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$In practice, the expectation over trajectories is calculated using a Monte-Carlo approach, i.e. simply sampling $N$ trajectories and average the term inside the expectation. Therefore, we will use the following estimated version of the policy gradient:
$$ \begin{align} \hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta}) &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{i,t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}) \\ &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \hat{q}_{i,t} \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). \end{align} $$Note the use of the notation $\hat{q}_{i,t}$ to represent the estimated action-value at time $t$ in the sampled trajectory $i$. Here $\hat{q}_{i,t}$ is acting as the weight-term for the policy gradient.
TODO: Complete the implementation of the VanillaPolicyGradientLoss class in the hw4/rl_pg.py module.
# Ensure deterministic run
env = gym.make(ENV_NAME)
env.seed(SEED)
torch.manual_seed(SEED)
def agent_fn():
# Use a simple "network" here, so that this test doesn't depend on
# your specific PolicyNet implementation
p_net_test = nn.Linear(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, bias=True)
agent = hw4pg.PolicyAgent(env, p_net_test)
return agent
dataloader = hw4data.TrainBatchDataset(agent_fn, gamma=0.9, episode_batch_size=4)
test_batch = next(iter(dataloader))
test_action_scores = torch.randn(len(test_batch), env.action_space.n)
print(f"{test_batch=}", end='\n\n')
print(f"test_action_scores=\n{test_action_scores}\nshape={test_action_scores.shape}", end='\n\n')
loss_fn_p = hw4pg.VanillaPolicyGradientLoss()
loss_p, _ = loss_fn_p(test_batch, test_action_scores)
print(f'{loss_p=}')
test.assertAlmostEqual(loss_p.item(), -48.560, delta=1e-2)
test_batch=TrainBatch(states: torch.Size([375, 8]), actions: torch.Size([375]), q_vals: torch.Size([375])), num_episodes: 4)
test_action_scores=
tensor([[ 0.8932, 0.4749, 0.8569, -0.7365],
[-0.7853, 1.0901, -0.0665, 1.2573],
[ 0.0867, -1.2705, -0.1987, -0.4103],
...,
[-0.7778, -2.4352, 0.1117, 0.9482],
[-1.4593, -0.0609, -0.1148, 1.5804],
[ 1.2975, -0.3326, -1.0626, 0.3869]])
shape=torch.Size([375, 4])
loss_p=tensor(-48.5605, dtype=torch.float64)
Another way to reduce the variance of our gradient is to use relative weighting of the log-policy instead of absolute reward values. $$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$ In other words, we don't measure a trajectory's worth by it's total reward, but by how much better that total reward is relative to some expected ("baseline") reward value, denoted above by $b$. Note that subtracting a baseline has no effect on the expected value of the policy gradient. It's easy to prove this directly by definition.
Here we'll implement a very simple baseline (not optimal in terms of variance reduction): the average of the estimated state-values $\hat{q}_{i,t}$.
TODO: Complete the implementation of the BaselinePolicyGradientLoss class in the hw4/rl_pg.py module.
# Using the same batch and action_scores from above cell
loss_fn_p = hw4pg.BaselinePolicyGradientLoss()
loss_p, loss_dict = loss_fn_p(test_batch, test_action_scores)
print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['baseline'], -29.841, delta=1e-2)
test.assertAlmostEqual(loss_p.item(), 1.297, delta=1e-2)
loss_dict={'loss_p': 1.2976918766803833, 'baseline': -29.841257246972788}
The entropy of a probability distribution (in our case the policy), is $$ H(\pi) = -\sum_{a} \pi(a|s)\log\pi(a|s). $$ The entropy is always positive and obtains it's maximum for a uniform distribution. We'll use the entropy of the policy as a bonus, i.e. we'll try to maximize it. The idea is the prevent the policy distribution from becoming too narrow and thus promote the agent's exploration.
First, we'll calculate the maximal possible entropy value of the action distribution for a set number of possible actions. This will be used as a normalization term.
TODO: Complete the implementation of the calc_max_entropy() method in the ActionEntropyLoss class.
loss_fn_e = hw4pg.ActionEntropyLoss(env.action_space.n)
print('max_entropy = ', loss_fn_e.max_entropy)
test.assertAlmostEqual(loss_fn_e.max_entropy, 1.38629436, delta=1e-3)
max_entropy = 1.3862943611198906
TODO: Complete the implementation of the forward() method in the ActionEntropyLoss class.
loss_e, _ = loss_fn_e(test_batch, test_action_scores)
print('loss = ', loss_e)
test.assertAlmostEqual(loss_e.item(), -0.8103, delta=1e-2)
loss = tensor(-0.8106)
We'll implement our training procedure as follows:
This is known as the REINFORCE algorithm.
Fortunately, we've already implemented everything we need for steps 1-4 so we need only a bit more code to put it all together.
The following block implements a wrapper, train_pg to create all the objects we need in order to train our policy gradient model.
import hw4.answers
from functools import partial
ENV_NAME = "Beresheet-v2"
def agent_fn_train(agent_type, p_net, seed, envs_dict):
winfo = torch.utils.data.get_worker_info()
wid = winfo.id if winfo else 0
seed = seed + wid if seed else wid
env = gym.make(ENV_NAME)
envs_dict[wid] = env
env.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
return agent_type(env, p_net)
def train_rl(agent_type, net_type, loss_fns, hp, seed=None, checkpoints_file=None, **train_kw):
print(f'hyperparams: {hp}')
envs = {}
p_net = net_type(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, **hp)
p_net.share_memory()
agent_fn = partial(agent_fn_train, agent_type, p_net, seed, envs)
dataset = hw4data.TrainBatchDataset(agent_fn, hp['batch_size'], hp['gamma'])
dataloader = DataLoader(
dataset, batch_size=None,
num_workers=hp['num_workers'],
multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)
optimizer = optim.Adam(p_net.parameters(), lr=hp['learn_rate'], eps=hp['eps'])
trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file)
try:
trainer.train(**train_kw)
except KeyboardInterrupt as e:
print('Training interrupted by user.')
finally:
for env in envs.values():
env.close()
# Include final model state
training_data = trainer.training_data
training_data['model_state'] = p_net.state_dict()
return training_data
def train_pg(baseline=False, entropy=False, **train_kwargs):
hp = hw4.answers.part1_pg_hyperparams()
loss_fns = []
if baseline:
loss_fns.append(hw4pg.BaselinePolicyGradientLoss())
else:
loss_fns.append(hw4pg.VanillaPolicyGradientLoss())
if entropy:
loss_fns.append(hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta']))
return train_rl(hw4pg.PolicyAgent, hw4pg.PolicyNet, loss_fns, hp, **train_kwargs)
The PolicyTrainer class implements the training loop, collects the losses and rewards and provides some useful checkpointing functionality.
The training loop will generate batches of episodes and train on them until either:
running_mean_len episodes is greater than the target_reward, ORmax_episodes.Most of this class is already implemented for you.
TODO:
train_batch() method of the PolicyTrainer.part1_pg_hyperparams() function within the hw4/answers.py module as needed. You get some sane defaults.Let's check whether our model is actually training. We'll try to reach a very low (bad) target reward, just as a sanity check to see that training works. Your model should be able to reach this target reward within a few batches.
You can increase the target reward and use this block to manually tweak your model and hyperparameters a few times.
target_reward = -140 # VERY LOW target
#target_reward = 0
train_data = train_pg(target_reward=target_reward, seed=SEED, max_episodes=2000, running_mean_len=10)
test.assertGreater(train_data['mean_reward'][-1], target_reward)
hyperparams: {'batch_size': 32, 'gamma': 0.99, 'beta': 0.05, 'learn_rate': 0.0015, 'eps': 1e-07, 'num_workers': 0, 'hidden_dims': 512}
=== Training...
#2: step=00009036, loss_p=-103.80, m_reward(10)=-125.6 (best=-168.7): 5%| | 96
=== 🚀 SOLVED - Target reward reached! 🚀
We'll now run a few experiments to see the effect of diferent loss functions on the training dynamics. Namely, we'll try:
vpg): No baseline, no entropybpg): Baseline, no entropy lossepg): No baseline, with entropy losscpg): Baseline, with entropy lossfrom collections import namedtuple
from pprint import pprint
import itertools as it
ExpConfig = namedtuple('ExpConfig', ('name','baseline','entropy'))
def exp_configs():
exp_names = ('vpg', 'epg', 'bpg', 'cpg')
z = zip(exp_names, it.product((False, True), (False, True)))
return (ExpConfig(n, b, e) for (n, (b, e)) in z)
pprint(list(exp_configs()))
[ExpConfig(name='vpg', baseline=False, entropy=False), ExpConfig(name='epg', baseline=False, entropy=True), ExpConfig(name='bpg', baseline=True, entropy=False), ExpConfig(name='cpg', baseline=True, entropy=True)]
We'll save the training data from each experiment for plotting.
import pickle
def dump_training_data(data, filename):
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, mode='wb') as file:
pickle.dump(data, file)
def load_training_data(filename):
with open(filename, mode='rb') as file:
return pickle.load(file)
Let's run the experiments! We'll run each configuration for a fixed number of episodes so that we can compare them.
Notes:
force_run to True.import math
exp_max_episodes = 4000
results = {}
training_data_filename = os.path.join('results', f'part1_exp.dat')
# Set to True to force re-run (careful! will delete old experiment results)
force_run = False
# Skip running if results file exists.
if os.path.isfile(training_data_filename) and not force_run:
print(f'=== results file {training_data_filename} exists, skipping experiments.')
results = load_training_data(training_data_filename)
else:
for n, b, e in exp_configs():
print(f'=== Experiment {n}')
results[n] = train_pg(baseline=b, entropy=e, max_episodes=exp_max_episodes, post_batch_fn=None)
dump_training_data(results, training_data_filename)
=== results file results/part1_exp.dat exists, skipping experiments.
def plot_experiment_results(results, fig=None):
if fig is None:
fig, _ = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(18,12))
for i, plot_type in enumerate(('loss_p', 'baseline', 'loss_e', 'mean_reward')):
ax = fig.axes[i]
for exp_name, exp_res in results.items():
if plot_type not in exp_res:
continue
ax.plot(exp_res['episode_num'], exp_res[plot_type], label=exp_name)
ax.set_title(plot_type)
ax.set_xlabel('episode')
ax.legend()
return fig
experiments_results_fig = plot_experiment_results(results)
You should see positive training dynamics in the graphs (reward going up). If you don't, use them to further update your model or hyperparams.
To pass the test, you'll need to get a best total mean reward of at least 10 in the fixed number of epochs using the combined loss. It's possible to get much higher (over 100).
best_cpg_mean_reward = max(results['cpg']['mean_reward'])
print(f'Best CPG mean reward: {best_cpg_mean_reward:.2f}')
test.assertGreater(best_cpg_mean_reward, 10)
Best CPG mean reward: 91.58
Now let's take a look at a gameplay video of our cpg model after the short training!
hp = hw4.answers.part1_pg_hyperparams()
p_net_cpg = hw4pg.PolicyNet.build_for_env(env, **hp)
p_net_cpg.load_state_dict(results['cpg']['model_state'])
env, n_steps, reward = hw4pg.PolicyAgent.monitor_episode(ENV_NAME, p_net_cpg)
print(f'{n_steps} steps, total reward: {reward:.2f}')
show_monitor_video(env)
300 steps, total reward: 54.55
We have seen that the policy-gradient loss can be interpreted as a log-likelihood of the policy term (selecting a specific action at a specific state), weighted by the future rewards of that choice of action.
However, naïvely weighting by rewards has significant drawbacks in terms of the variance of the resulting gradient. We addressed this by adding a simple baseline term which represented our "expected reward" so that we increase probability of actions leading to trajectories which exceed this expectation and vice-versa.
In this part we'll explore a more powerful baseline, which is the idea behind the AAC method.
Recall the definition of the state-value function $v_{\pi}(s)$ and action-value function $q_{\pi}(s,a)$:
$$ \begin{align} v_{\pi}(s) &= \E{g(\tau)|s_0 = s,\pi} \\ q_{\pi}(s,a) &= \E{g(\tau)|s_0 = s,a_0=a,\pi}. \end{align} $$Both these functions represent the value of the state $s$. However, $v_\pi$ averages over the first action according to the policy, while $q_\pi$ fixes the first action and then continues according to the policy.
Their difference is known as the advantage function: $$ a_\pi(s,a) = q_\pi(s,a)-v_\pi(s). $$
If $a_\pi(s,a)>0$ it means that it's better (in expectation) to take action $a$ in state $s$ compared to the average action. In other words, $a_\pi(s,a)$ represents the advantage of using action $a$ in state $s$ compared to the others.
So far we have used an estimate for $q_\pi$ as our weighting term for the log-policy, with a fixed baseline per batch.
$$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$Now, we will use the state value as a baseline, so that an estimate of the advantage function is our weighting term:
$$ \hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-v_\pi(s_t)\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$Intuitively, using the advantage function makes sense because it means we're weighting our policy's actions according to how advantageous they are compared to other possible actions.
But how will we know $v_\pi(s)$? We'll learn it of course, using another neural network. This is known as actor-critic learning. We simultaneously learn the policy (actor) and the value of states (critic). We'll treat it as a regression task: given a state $s_t$, our state-value network will output $\hat{v}_\pi(s_t)$, an estimate of the actual unknown state-value. Our regression targets will be the discounted rewards, $\hat{q}_{i,t}$ (see question 2), and we can use a simple MSE as the loss function, $$ \mathcal{L}_{\text{SV}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0}\left(\hat{v}_\pi(s_t) - \hat{q}_{i,t}\right)^2. $$
We'll build heavily on our implementation of the regular policy-gradient method, and just add a new model class and a new loss class, with a small modification to the agent.
Let's start with the model. It will accept a state, and return action scores (as before), but also the value of that state. You can experiment with a dual-head network that has a shared base, or implement two separate parts within the network.
TODO:
AACPolicyNet class in the hw4/rl_ac.py module.part1_aac_hyperparams() function of the hw4.answers module.import hw4.rl_ac as hw4ac
hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, device, **hp)
pv_net
AACPolicyNet(
(fc): Sequential(
(0): Linear(in_features=8, out_features=512, bias=True)
(1): ReLU()
)
(policy): Sequential(
(0): Linear(in_features=512, out_features=4, bias=True)
)
(value): Sequential(
(0): Linear(in_features=512, out_features=1, bias=True)
)
)
TODO: Complete the implementation of the agent class, AACPolicyAgent, in the hw4/rl_ac.py module.
agent = hw4ac.AACPolicyAgent(env, pv_net, device)
exp = agent.step()
test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-0.0066, 1.3987, -0.6635, -0.5428, 0.0076, 0.1503, 0.0000, 0.0000]), action=0, reward=-1.0456538864927154, is_done=False)
TODO: Implement the AAC loss function as the class AACPolicyGradientLoss in the hw4/rl_ac.py module.
loss_fn_aac = hw4ac.AACPolicyGradientLoss(delta=1.)
test_state_values = torch.ones(test_action_scores.shape[0], 1)
loss_t, loss_dict = loss_fn_aac(test_batch, (test_action_scores, test_state_values))
print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['adv_m'], -30.841, delta=1e-2)
test.assertAlmostEqual(loss_t.item(), 1466.830, delta=1e-2)
loss_dict={'loss_p': -50.23126021207819, 'loss_v': 1517.0619799854114, 'adv_m': -30.84125724697279}
Let's run the same experiment as before, but with the AAC method and compare the results.
def train_aac(baseline=False, entropy=False, **train_kwargs):
hp = hw4.answers.part1_aac_hyperparams()
loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])]
return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs)
training_data_filename = os.path.join('results', f'part1_exp_aac.dat')
# Set to True to force re-run (careful, will delete old experiment results)
force_run = False
if os.path.isfile(training_data_filename) and not force_run:
print(f'=== results file {training_data_filename} exists, skipping experiments.')
results_aac = load_training_data(training_data_filename)
else:
print(f'=== Running AAC experiment')
training_data = train_aac(max_episodes=exp_max_episodes)
results_aac = dict(aac=training_data)
dump_training_data(results_aac, training_data_filename)
=== results file results/part1_exp_aac.dat exists, skipping experiments.
experiments_results_fig = plot_experiment_results(results)
plot_experiment_results(results_aac, fig=experiments_results_fig);
You should get better results with the AAC method, so this time the bar is higher (again, you should aim for a mean reward of 100+). Compare the graphs with combined PG method and see if they make sense.
best_aac_mean_reward = max(results_aac['aac']['mean_reward'])
print(f'Best AAC mean reward: {best_aac_mean_reward:.2f}')
test.assertGreater(best_aac_mean_reward, 50)
Best AAC mean reward: 86.91
Now, using your best model and hyperparams, let's train model for much longer and see the performance. Just for fun, we'll also visualize an episode every now and then so that we can see how well the agent is playing.
TODO:
_final to the file name.
This will cause the block to skip training and instead load your saved model when running the homework submission script.
Note that your submission zip file will not include the checkpoint file. This is OK.import IPython.display
CHECKPOINTS_FILE = f'checkpoints/{ENV_NAME}-ac.dat'
CHECKPOINTS_FILE_FINAL = f'checkpoints/{ENV_NAME}-ac_final.dat'
TARGET_REWARD = 125
MAX_EPISODES = 15_000
def post_batch_fn(batch_idx, p_net, batch, print_every=20, final=False):
if not final and batch_idx % print_every != 0:
return
env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net)
html = show_monitor_video(env, width="500")
IPython.display.clear_output(wait=True)
print(f'Monitor@#{batch_idx}: n_steps={n_steps}, total_reward={reward:.3f}, final={final}')
IPython.display.display_html(html)
if os.path.isfile(CHECKPOINTS_FILE_FINAL):
print(f'=== {CHECKPOINTS_FILE_FINAL} exists, skipping training...')
checkpoint_data = torch.load(CHECKPOINTS_FILE_FINAL)
hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, **hp)
pv_net.load_state_dict(checkpoint_data['params'])
print(f'=== Running best model...')
env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, pv_net)
print(f'=== Best model ran for {n_steps} steps. Total reward: {reward:.2f}')
IPython.display.display_html(show_monitor_video(env))
best_mean_reward = checkpoint_data["best_mean_reward"]
else:
print(f'=== Starting training...')
train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES,
seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn)
print(f'=== Done, ', end='')
best_mean_reward = train_data["best_mean_reward"][-1]
print(f'num_episodes={train_data["episode_num"][-1]}, best_mean_reward={best_mean_reward:.1f}')
test.assertGreaterEqual(best_mean_reward, TARGET_REWARD)
Monitor@#1120: n_steps=295, total_reward=289.778, final=False
#1140: step=03834558, loss_p= -7.20, loss_v= 15.31, adv_m=-13.63, loss_e= -0.00,
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [34], in <module> 29 else: 30 print(f'=== Starting training...') ---> 31 train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES, 32 seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn) 33 print(f'=== Done, ', end='') 34 best_mean_reward = train_data["best_mean_reward"][-1] Input In [30], in train_aac(baseline, entropy, **train_kwargs) 2 hp = hw4.answers.part1_aac_hyperparams() 3 loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])] ----> 4 return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs) Input In [19], in train_rl(agent_type, net_type, loss_fns, hp, seed, checkpoints_file, **train_kw) 35 trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file) 36 try: ---> 37 trainer.train(**train_kw) 38 except KeyboardInterrupt as e: 39 print('Training interrupted by user.') File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:420, in PolicyTrainer.train(self, target_reward, running_mean_len, max_episodes, post_batch_fn) 418 if episode_num >= max_episodes: 419 terminate = f"\n=== STOPPING - Max episode reached" --> 420 post_batch_fn(i, self.model, batch, final=terminate is not None) 421 if terminate: 422 break Input In [34], in post_batch_fn(batch_idx, p_net, batch, print_every, final) 9 if not final and batch_idx % print_every != 0: 10 return ---> 11 env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net) 12 html = show_monitor_video(env, width="500") 13 IPython.display.clear_output(wait=True) File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:155, in PolicyAgent.monitor_episode(cls, env_name, p_net, monitor_dir, device) 146 n_steps, reward = 0, 0.0 147 with gym.wrappers.Monitor( 148 gym.make(env_name), monitor_dir, video_callable=None, force=True 149 ) as env: (...) 153 # ====== YOUR CODE: ====== 154 #agent = PolicyAgent(env, p_net, device) --> 155 agent = cls(env, p_net, device) 156 is_done = False 157 n_steps = 0 File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:79, in PolicyAgent.__init__(self, env, p_net, device) 77 self.curr_state = None 78 self.curr_episode_reward = None ---> 79 self.reset() File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:83, in PolicyAgent.reset(self) 81 def reset(self): 82 self.curr_state = torch.tensor( ---> 83 self.env.reset(), device=self.device, dtype=torch.float 84 ) 85 self.curr_episode_reward = 0.0 File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:56, in Monitor.reset(self, **kwargs) 54 self._before_reset() 55 observation = self.env.reset(**kwargs) ---> 56 self._after_reset(observation) 58 return observation File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:241, in Monitor._after_reset(self, observation) 238 # Reset the stat count 239 self.stats_recorder.after_reset(observation) --> 241 self.reset_video_recorder() 243 # Bump *after* all reset activity has finished 244 self.episode_id += 1 File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:267, in Monitor.reset_video_recorder(self) 253 # Start recording the next video. 254 # 255 # TODO: calculate a more correct 'episode_id' upon merge 256 self.video_recorder = video_recorder.VideoRecorder( 257 env=self.env, 258 base_path=os.path.join( (...) 265 enabled=self._video_enabled(), 266 ) --> 267 self.video_recorder.capture_frame() File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitoring/video_recorder.py:132, in VideoRecorder.capture_frame(self) 129 logger.debug("Capturing video frame: path=%s", self.path) 131 render_mode = "ansi" if self.ansi_mode else "rgb_array" --> 132 frame = self.env.render(mode=render_mode) 134 if frame is None: 135 if self._async: File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/core.py:295, in Wrapper.render(self, mode, **kwargs) 294 def render(self, mode="human", **kwargs): --> 295 return self.env.render(mode, **kwargs) File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/box2d/lunar_lander.py:391, in LunarLander.render(self, mode) 388 from gym.envs.classic_control import rendering 390 if self.viewer is None: --> 391 self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H) 392 self.viewer.set_bounds(0, VIEWPORT_W / SCALE, 0, VIEWPORT_H / SCALE) 394 for obj in self.particles: File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/classic_control/rendering.py:88, in Viewer.__init__(self, width, height, display) 86 self.width = width 87 self.height = height ---> 88 self.window = get_window(width=width, height=height, display=display) 89 self.window.on_close = self.window_closed_by_user 90 self.isopen = True File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/classic_control/rendering.py:69, in get_window(width, height, display, **kwargs) 65 """ 66 Will create a pyglet window from the display specification provided. 67 """ 68 screen = display.get_screens() # available screens ---> 69 config = screen[0].get_best_config() # selecting the first screen 70 context = config.create_context(None) # create GL context 72 return pyglet.window.Window( 73 width=width, 74 height=height, (...) 78 **kwargs 79 ) IndexError: list index out of range
TODO: Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs236781.answers import display_answer
import hw4.answers
Explain qualitatively why subtracting a baseline in the policy-gradient helps reduce it's variance. Specifically, give an example where it helps.
display_answer(hw4.answers.part1_q1)
In AAC, when using the estimated q-values as regression targets for our state-values, why do we get a valid approximation? Hint: how is $v_\pi(s)$ expressed in terms of $q_\pi(s,a)$?
display_answer(hw4.answers.part1_q2)
cpg).display_answer(hw4.answers.part1_q3)
In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.
We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)
However, if you feel adventurous and/or prefer to generate something else, feel free
to edit the PART2_CUSTOM_DATA_URL variable in hw4/answers.py.
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/rudman/.pytorch-datasets/lfw-bush.zip exists, skipping download. Extracting /home/rudman/.pytorch-datasets/lfw-bush.zip... Extracted 531 to /home/rudman/.pytorch-datasets/lfw/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).
While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.
We define, in Baysean terminology,
To create our variational decoder we'll further specify:
This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.
Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.
To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):
$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.
Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} {\bb{x}} \left[ \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ -\log p {\bb{\beta}}(\bb{x} | \bb{z}) \right]
By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as
$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).
First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.
import hw4.autoencoder as autoencoder
in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)
h = encoder_cnn(x0)
print(h.shape)
test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): Conv2d(128, 1024, kernel_size=(4, 4), stride=(2, 2))
)
)
torch.Size([1, 1024, 2, 2])
Now let's implement the CNN part of the Decoder.
Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced
by your EncoderCNN and output an image of the same dimensions as the Encoder's input was.
This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc.
Consult the documentation of ConvTranspose2D
to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.
TODO: Implement the DecoderCNN class in the hw4/autoencoder.py module.
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)
test.assertEqual(x0.shape, x0r.shape)
# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
)
torch.Size([1, 3, 64, 64])
Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:
\bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\
\log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}}
\end{align}
$$Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.
TODO: Implement the encode() method in the VAE class within the hw4/autoencoder.py module.
You'll also need to define your parameters in __init__().
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)
z, mu, log_sigma2 = vae.encode(x0)
test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)
print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): Conv2d(128, 1024, kernel_size=(4, 4), stride=(2, 2))
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
)
(mu_layer): Sequential(
(0): Linear(in_features=4096, out_features=2, bias=True)
)
(log_sigma2_layer): Sequential(
(0): Linear(in_features=4096, out_features=2, bias=True)
)
(decoder_in): Sequential(
(0): Linear(in_features=2, out_features=4096, bias=True)
)
)
mu(x0)=[-0.4129818, 0.2801996], sigma2(x0)=[1.0662273, 0.9347149]
Let's sample some 2d latent representations for an input image x0 and visualize them.
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
for i in range(N):
Z[i], _, _ = vae.encode(x0)
ax.scatter(*Z[i].cpu().numpy())
# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([-0.4850, 0.3184]) sampled sigma2 tensor([0.8914, 0.8104])
Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:
TODO: Implement the decode() method in the VAE class within the hw4/autoencoder.py module.
You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.
x0r = vae.decode(z)
test.assertSequenceEqual(x0r.shape, x0.shape)
Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.
x0r, mu, log_sigma2 = vae(x0)
test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:
$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.
TODO: Implement the vae_loss() function in the hw4/autoencoder.py module.
from hw4.autoencoder import vae_loss
torch.manual_seed(42)
def test_vae_loss():
# Test data
N, C, H, W = 10, 3, 64, 64
z_dim = 32
x = torch.randn(N, C, H, W)*2 - 1
xr = torch.randn(N, C, H, W)*2 - 1
z_mu = torch.randn(N, z_dim)
z_log_sigma2 = torch.randn(N, z_dim)
x_sigma2 = 0.9
loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
return loss
test_vae_loss()
tensor(58.3234)
The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.
TODO: Implement the sample() method in the VAE class within the hw4/autoencoder.py module.
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)
Time to train!
TODO:
VAETrainer class in the hw4/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.part2_vae_hyperparams() function within the hw4/answers.py module.import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw4.training import VAETrainer
from hw4.answers import part2_vae_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']
# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test = DataLoader(ds_test, batch_size, shuffle=True)
im_size = ds_train[0][0].shape
# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)
# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)
# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show model and hypers
print(vae)
print(hp)
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): Conv2d(128, 512, kernel_size=(4, 4), stride=(2, 2))
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
)
(mu_layer): Sequential(
(0): Linear(in_features=2048, out_features=128, bias=True)
)
(log_sigma2_layer): Sequential(
(0): Linear(in_features=2048, out_features=128, bias=True)
)
(decoder_in): Sequential(
(0): Linear(in_features=128, out_features=2048, bias=True)
)
)
{'batch_size': 32, 'h_dim': 512, 'z_dim': 128, 'x_sigma2': 0.0001, 'learn_rate': 0.001, 'betas': (0.5, 0.55)}
TODO:
_final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.The images you get should be colorful, with different backgrounds and poses.
import IPython.display
def post_epoch_fn(epoch, train_result, test_result, verbose):
# Plot some samples if this is a verbose epoch
if verbose:
samples = vae.sample(n=5)
fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
checkpoint_file = checkpoint_file_final
else:
res = trainer.fit(dl_train, dl_test,
num_epochs=200, early_stopping=20, print_every=10,
checkpoints=checkpoint_file,
post_epoch_fn=post_epoch_fn)
# Plot images from best model
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
--- EPOCH 1/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint --- EPOCH 11/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 21/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 31/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint --- EPOCH 41/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 51/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 61/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 71/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 81/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 91/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 101/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 111/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 121/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 131/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 141/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 151/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 161/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 171/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 181/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 191/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 200/200 ---
train_batch: 0%| | 0/15 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Input In [16], in <module> 15 res = trainer.fit(dl_train, dl_test, 16 num_epochs=200, early_stopping=20, print_every=10, 17 checkpoints=checkpoint_file, 18 post_epoch_fn=post_epoch_fn) 20 # Plot images from best model ---> 21 saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device) 22 vae_dp.load_state_dict(saved_state['model_state']) 23 print('*** Images Generated from best model:') File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:594, in load(f, map_location, pickle_module, **pickle_load_args) 591 if 'encoding' not in pickle_load_args.keys(): 592 pickle_load_args['encoding'] = 'utf-8' --> 594 with _open_file_like(f, 'rb') as opened_file: 595 if _is_zipfile(opened_file): 596 # The zipfile reader is going to advance the current file position. 597 # If we want to actually tail call to torch.jit.load, we need to 598 # reset back to the original position. 599 orig_position = opened_file.tell() File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:230, in _open_file_like(name_or_buffer, mode) 228 def _open_file_like(name_or_buffer, mode): 229 if _is_path(name_or_buffer): --> 230 return _open_file(name_or_buffer, mode) 231 else: 232 if 'w' in mode: File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:211, in _open_file.__init__(self, name, mode) 210 def __init__(self, name, mode): --> 211 super(_open_file, self).__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/vae.pt'
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs236781.answers import display_answer
import hw4.answers as answers
What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.
display_answer(answers.part2_q1)
Your answer:
$\sigma^2$ - likelihood variance, it does the regularization of the data-reconstruction loss, which is the term in the total loss equation. High values will reduce the influence of this term on the total loss, this will favour the regularisation term over the reconstruction term, it will cause the images to be closer to the input. The opposite stands if $\sigma^2$ is low.
display_answer(answers.part2_q2)
Your answer:
In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?
display_answer(answers.part2_q3)
Your answer:
By maximizing the evidence distribution we provide our model an ability to generate data from the latent space with the same distribution as the data in an instance space.
In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?
display_answer(answers.part2_q4)
Your answer:
We model the log to increase the range of the latent space distribution, since the values of the variance are always positive.
In this part we will implement and train a generative adversarial network and apply it to the task of image generation.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cpu
We'll use the same data as in Part 2.
But again, you can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw4/answers.py.
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART3_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /Users/romy/.pytorch-datasets/lfw-bush.zip exists, skipping download. Extracting /Users/romy/.pytorch-datasets/lfw-bush.zip... Extracted 531 to /Users/romy/.pytorch-datasets/lfw/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.
In a GAN model, two different neural networks compete against each other: A generator and a discriminator.
The Generator, which we'll denote as $\Psi _{\bb{\gamma}} : \mathcal{U} \rightarrow \mathcal{X}$, maps a latent-space variable $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ to an instance-space variable $\bb{x}$ (e.g. an image). Thus a parametric evidence distribution $p_{\bb{\gamma}}(\bb{X})$ is generated, which we typically would like to be as close as possible to the real evidence distribution, $p(\bb{X})$.
The Discriminator, $\Delta _{\bb{\delta}} : \mathcal{X} \rightarrow [0,1]$, is a network which, given an instance-space variable $\bb{x}$, returns the probability that $\bb{x}$ is real, i.e. that $\bb{x}$ was sampled from $p(\bb{X})$ and not $p_{\bb{\gamma}}(\bb{X})$.

The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:
$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.
We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.
TODO: Implement the Discriminator class in the hw4/gan.py module.
If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.
import hw4.gan as gan
dsc = gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)
d0 = dsc(x0)
print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
(cnn): Sequential(
(0): Conv2d(3, 4, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(4, 8, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): LeakyReLU(negative_slope=0.2, inplace=True)
(5): Conv2d(8, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(16, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
)
(fc): Sequential(
(0): Linear(in_features=25, out_features=1, bias=True)
)
)
torch.Size([1, 1])
TODO: Implement the Generator class in the hw4/gan.py module.
If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.
z_dim = 128
gen = gan.Generator(z_dim, 4).to(device)
print(gen)
z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
(net): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): Tanh()
)
(projection): Sequential(
(0): Linear(in_features=128, out_features=16384, bias=True)
)
)
torch.Size([1, 3, 64, 64])
Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$
We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.
GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.
We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.
TODO: Implement the discriminator_loss_fn() function in the hw4/gan.py module.
from hw4.gan import discriminator_loss_fn
torch.manual_seed(42)
y_data = torch.rand(10) * 10
y_generated = torch.rand(10) * 10
loss = discriminator_loss_fn(y_data, y_generated, data_label=1, label_noise=0.3)
print(loss)
test.assertAlmostEqual(loss.item(), 6.4808731, delta=1e-5)
tensor(6.4809)
Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$
which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.
TODO: Implement the generator_loss_fn() function in the hw4/gan.py module.
from hw4.gan import generator_loss_fn
torch.manual_seed(42)
y_generated = torch.rand(20) * 10
loss = generator_loss_fn(y_generated, data_label=1)
print(loss)
test.assertAlmostEqual(loss.item(), 0.0222969, delta=1e-3)
tensor(0.0223)
Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.
There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).
TODO: Implement the sample() method in the Generator class within the hw4/gan.py module.
samples = gen.sample(5, with_grad=False)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNone(samples.grad_fn)
_ = plot.tensors_as_images(samples.cpu())
samples = gen.sample(5, with_grad=True)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNotNone(samples.grad_fn)
Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.
As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)
TODO:
train_batch function in the hw4/gan.py module.part3_gan_hyperparams() function within the hw4/answers.py module.import torch.optim as optim
from torch.utils.data import DataLoader
from hw4.answers import part3_gan_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part3_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return gan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'checkpoints/gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
{'batch_size': 4, 'z_dim': 512, 'data_label': 0, 'label_noise': 0.4, 'discriminator_optimizer': {'type': 'SGD', 'lr': 0.01}, 'generator_optimizer': {'type': 'SGD', 'lr': 0.01}}
TODO:
save_checkpoint function in the hw4.gan module. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch._final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.import IPython.display
import tqdm
from hw4.gan import train_batch, save_checkpoint
num_epochs = 100
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device,)
checkpoint_file = checkpoint_file_final
try:
dsc_avg_losses, gen_avg_losses = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc, gen,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen.sample(5, with_grad=False)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
--- EPOCH 1/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.67it/s] Discriminator loss: 1.3612773373610991 Generator loss: 0.7768391115324838
--- EPOCH 2/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.57it/s] Discriminator loss: 1.353277926158188 Generator loss: 0.7538853556589973
--- EPOCH 3/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.53it/s] Discriminator loss: 1.3340200499484414 Generator loss: 0.7980090760646906
--- EPOCH 4/100 --- 100%|█████████████████████████████████████████| 133/133 [00:35<00:00, 3.78it/s] Discriminator loss: 1.3310418805681674 Generator loss: 0.8058272622581711
--- EPOCH 5/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.40it/s] Discriminator loss: 1.3612716928460544 Generator loss: 0.7818780236674431
--- EPOCH 6/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.62it/s] Discriminator loss: 1.3134378329255527 Generator loss: 0.7463640691642475
--- EPOCH 7/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.69it/s] Discriminator loss: 1.3780381191045719 Generator loss: 0.7370426462108928
--- EPOCH 8/100 --- 100%|█████████████████████████████████████████| 133/133 [00:34<00:00, 3.85it/s] Discriminator loss: 1.3114121394946163 Generator loss: 0.8153696891508604
--- EPOCH 9/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 3.92it/s] Discriminator loss: 1.3627268837806874 Generator loss: 0.776933315105008
--- EPOCH 10/100 --- 100%|█████████████████████████████████████████| 133/133 [00:38<00:00, 3.46it/s] Discriminator loss: 1.3678826972057945 Generator loss: 0.7519586512020656
--- EPOCH 11/100 --- 100%|█████████████████████████████████████████| 133/133 [00:35<00:00, 3.80it/s] Discriminator loss: 1.3360447986681658 Generator loss: 0.7902616858482361
--- EPOCH 12/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.21it/s] Discriminator loss: 1.3116696490380997 Generator loss: 0.7913338891545633
--- EPOCH 13/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.06it/s] Discriminator loss: 1.2946408184847438 Generator loss: 0.8173314055105797
--- EPOCH 14/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.64it/s] Discriminator loss: 1.330717668945628 Generator loss: 0.8217142690393261
--- EPOCH 15/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.70it/s] Discriminator loss: 1.2794516014873534 Generator loss: 0.8819283063250377
--- EPOCH 16/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.26it/s] Discriminator loss: 1.3264111199773343 Generator loss: 0.8114233695922938
--- EPOCH 17/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.42it/s] Discriminator loss: 1.3732305074992932 Generator loss: 0.7831786958346689
--- EPOCH 18/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.07it/s] Discriminator loss: 1.3156394420709825 Generator loss: 0.8114110406180074
--- EPOCH 19/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.07it/s] Discriminator loss: 1.3141640612953587 Generator loss: 0.8553277307883241
--- EPOCH 20/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.44it/s] Discriminator loss: 1.3257308136251635 Generator loss: 0.8442430688922566
--- EPOCH 21/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.71it/s] Discriminator loss: 1.3306942053307267 Generator loss: 0.7536655525515851
--- EPOCH 22/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.63it/s] Discriminator loss: 1.2691815999665654 Generator loss: 0.928295197791623
--- EPOCH 23/100 --- 100%|█████████████████████████████████████████| 133/133 [00:27<00:00, 4.76it/s] Discriminator loss: 1.3326885328256994 Generator loss: 0.7756251746550539
--- EPOCH 24/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.65it/s] Discriminator loss: 1.358824984919756 Generator loss: 0.7776076338793102
--- EPOCH 25/100 --- 100%|█████████████████████████████████████████| 133/133 [23:59<00:00, 10.82s/it] Discriminator loss: 1.3053617580492693 Generator loss: 0.7909960948434988
--- EPOCH 26/100 --- 100%|█████████████████████████████████████████| 133/133 [00:36<00:00, 3.66it/s] Discriminator loss: 1.348939643766647 Generator loss: 0.7909385526090636
--- EPOCH 27/100 --- 100%|█████████████████████████████████████████| 133/133 [00:39<00:00, 3.39it/s] Discriminator loss: 1.3605027579723443 Generator loss: 0.7755582498428517
--- EPOCH 28/100 --- 100%|█████████████████████████████████████████| 133/133 [00:37<00:00, 3.56it/s] Discriminator loss: 1.3628243584381907 Generator loss: 0.7874012177151845
--- EPOCH 29/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.35it/s] Discriminator loss: 1.3484721968048496 Generator loss: 0.7750240040004701
--- EPOCH 30/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.18it/s] Discriminator loss: 1.3944145772690164 Generator loss: 0.7293257641613036
--- EPOCH 31/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.10it/s] Discriminator loss: 1.3444358065612334 Generator loss: 0.7094572049782688
--- EPOCH 32/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 4.00it/s] Discriminator loss: 1.3868964551983023 Generator loss: 0.7158935560767812
--- EPOCH 33/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.20it/s] Discriminator loss: 1.371898485305614 Generator loss: 0.7572996813551824
--- EPOCH 34/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.30it/s] Discriminator loss: 1.3468890181161408 Generator loss: 0.7621907963788599
--- EPOCH 35/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.19it/s] Discriminator loss: 1.3391794244149573 Generator loss: 0.7658723104268985
--- EPOCH 36/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.42it/s] Discriminator loss: 1.3757759722551905 Generator loss: 0.7750920158131678
--- EPOCH 37/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.21it/s] Discriminator loss: 1.353216873075729 Generator loss: 0.7472609581803917
--- EPOCH 38/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.35it/s] Discriminator loss: 1.386530635948468 Generator loss: 0.6613477049465466
--- EPOCH 39/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.24it/s] Discriminator loss: 1.379711373408038 Generator loss: 0.6873996737308072
--- EPOCH 40/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.31it/s] Discriminator loss: 1.3612149953842163 Generator loss: 0.7055589677695941
--- EPOCH 41/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.19it/s] Discriminator loss: 1.3983838208635946 Generator loss: 0.6888383317710762
--- EPOCH 42/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.53it/s] Discriminator loss: 1.4074606949225403 Generator loss: 0.6762563035004121
--- EPOCH 43/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.67it/s] Discriminator loss: 1.3895688307912726 Generator loss: 0.7187479797162508
--- EPOCH 44/100 --- 100%|█████████████████████████████████████████| 133/133 [00:27<00:00, 4.92it/s] Discriminator loss: 1.387302753620578 Generator loss: 0.697377728340321
--- EPOCH 45/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.71it/s] Discriminator loss: 1.3809221131461007 Generator loss: 0.7106191462143919
--- EPOCH 46/100 --- 100%|█████████████████████████████████████████| 133/133 [00:27<00:00, 4.90it/s] Discriminator loss: 1.399391840275069 Generator loss: 0.6971177705248496
--- EPOCH 47/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.20it/s] Discriminator loss: 1.3900526726156248 Generator loss: 0.7189247890522605
--- EPOCH 48/100 --- 100%|█████████████████████████████████████████| 133/133 [00:27<00:00, 4.84it/s] Discriminator loss: 1.3859154084571321 Generator loss: 0.7078896944684193
--- EPOCH 49/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.50it/s] Discriminator loss: 1.3643943394037117 Generator loss: 0.7274110178302106
--- EPOCH 50/100 --- 100%|█████████████████████████████████████████| 133/133 [00:27<00:00, 4.82it/s] Discriminator loss: 1.3860085512462414 Generator loss: 0.6928415022846451
--- EPOCH 51/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.42it/s] Discriminator loss: 1.377233802824092 Generator loss: 0.6874544288879051
--- EPOCH 52/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.29it/s] Discriminator loss: 1.3724496741043894 Generator loss: 0.7047534024805054
--- EPOCH 53/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.41it/s] Discriminator loss: 1.3929707932292967 Generator loss: 0.7077379930288272
--- EPOCH 54/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.39it/s] Discriminator loss: 1.3775214942774379 Generator loss: 0.7026135545027884
--- EPOCH 55/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.10it/s] Discriminator loss: 1.3894198971583431 Generator loss: 0.7005884270918997
--- EPOCH 56/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.44it/s] Discriminator loss: 1.3808675179804177 Generator loss: 0.7141908115910408
--- EPOCH 57/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.38it/s] Discriminator loss: 1.3837377415563827 Generator loss: 0.7051679801223869
--- EPOCH 58/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.64it/s] Discriminator loss: 1.3683975701941584 Generator loss: 0.7369146907239928
--- EPOCH 59/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.40it/s] Discriminator loss: 1.3829459660035326 Generator loss: 0.6538807772155991
--- EPOCH 60/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 3.95it/s] Discriminator loss: 1.3677203225013905 Generator loss: 0.7293664266292313
--- EPOCH 61/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.18it/s] Discriminator loss: 1.3800513860874606 Generator loss: 0.7348918784829906
--- EPOCH 62/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.73it/s] Discriminator loss: 1.3744387053009262 Generator loss: 0.7359270601344288
--- EPOCH 63/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 4.01it/s] Discriminator loss: 1.373998192916239 Generator loss: 0.7276788197065654
--- EPOCH 64/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.14it/s] Discriminator loss: 1.3920388490633857 Generator loss: 0.7200995710559357
--- EPOCH 65/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.36it/s] Discriminator loss: 1.369957351146784 Generator loss: 0.7154990430165055
--- EPOCH 66/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.09it/s] Discriminator loss: 1.3607412388450222 Generator loss: 0.7023641113051795
--- EPOCH 67/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 4.02it/s] Discriminator loss: 1.3709515377991182 Generator loss: 0.7412473272560234
--- EPOCH 68/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.28it/s] Discriminator loss: 1.3542222170005167 Generator loss: 0.7531414652677407
--- EPOCH 69/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 3.98it/s] Discriminator loss: 1.3692521493237717 Generator loss: 0.7447243252194914
--- EPOCH 70/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.29it/s] Discriminator loss: 1.3937095358855742 Generator loss: 0.7002077281923222
--- EPOCH 71/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.16it/s] Discriminator loss: 1.392579540274197 Generator loss: 0.6932471712729088
--- EPOCH 72/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.38it/s] Discriminator loss: 1.3606340670047845 Generator loss: 0.6910830140113831
--- EPOCH 73/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.32it/s] Discriminator loss: 1.380547304798786 Generator loss: 0.7060410550662449
--- EPOCH 74/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.41it/s] Discriminator loss: 1.3912733766369354 Generator loss: 0.6862335025816035
--- EPOCH 75/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.09it/s] Discriminator loss: 1.3533304427799426 Generator loss: 0.7621926060296539
--- EPOCH 76/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.46it/s] Discriminator loss: 1.3760721602834256 Generator loss: 0.71099306004388
--- EPOCH 77/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.25it/s] Discriminator loss: 1.375013388189158 Generator loss: 0.7724691799708775
--- EPOCH 78/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.25it/s] Discriminator loss: 1.3783018929617745 Generator loss: 0.7241247108108119
--- EPOCH 79/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 3.93it/s] Discriminator loss: 1.3538859948179776 Generator loss: 0.7574535751701298
--- EPOCH 80/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 3.97it/s] Discriminator loss: 1.3742202932673289 Generator loss: 0.7243723779692686
--- EPOCH 81/100 --- 100%|█████████████████████████████████████████| 133/133 [00:42<00:00, 3.10it/s] Discriminator loss: 1.363932890999586 Generator loss: 0.7276730062370014
--- EPOCH 82/100 --- 100%|█████████████████████████████████████████| 133/133 [00:37<00:00, 3.53it/s] Discriminator loss: 1.3647771491143936 Generator loss: 0.7565775791505226
--- EPOCH 83/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.09it/s] Discriminator loss: 1.38117810299522 Generator loss: 0.7473030493671733
--- EPOCH 84/100 --- 100%|█████████████████████████████████████████| 133/133 [00:32<00:00, 4.14it/s] Discriminator loss: 1.3946288716524167 Generator loss: 0.6966705824199476
--- EPOCH 85/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.23it/s] Discriminator loss: 1.3750159507407282 Generator loss: 0.7198978205372516
--- EPOCH 86/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.27it/s] Discriminator loss: 1.3447362444454567 Generator loss: 0.74971271054189
--- EPOCH 87/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.48it/s] Discriminator loss: 1.3901844266662025 Generator loss: 0.7170487498878536
--- EPOCH 88/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.30it/s] Discriminator loss: 1.356699283857991 Generator loss: 0.7160336357310302
--- EPOCH 89/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.26it/s] Discriminator loss: 1.3760145544109488 Generator loss: 0.7182338896550631
--- EPOCH 90/100 --- 100%|█████████████████████████████████████████| 133/133 [00:33<00:00, 4.03it/s] Discriminator loss: 1.3508277844665642 Generator loss: 0.7397988003895695
--- EPOCH 91/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.49it/s] Discriminator loss: 1.3720417040631288 Generator loss: 0.7488065503145519
--- EPOCH 92/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.44it/s] Discriminator loss: 1.342560267089901 Generator loss: 0.7304007343779829
--- EPOCH 93/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.55it/s] Discriminator loss: 1.3711320430712592 Generator loss: 0.7673848535781517
--- EPOCH 94/100 --- 100%|█████████████████████████████████████████| 133/133 [00:28<00:00, 4.72it/s] Discriminator loss: 1.3838464720804888 Generator loss: 0.6711053323924989
--- EPOCH 95/100 --- 100%|█████████████████████████████████████████| 133/133 [00:31<00:00, 4.27it/s] Discriminator loss: 1.4033516260018026 Generator loss: 0.6842943461317765
--- EPOCH 96/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.42it/s] Discriminator loss: 1.380405118590907 Generator loss: 0.72748539411932
--- EPOCH 97/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.29it/s] Discriminator loss: 1.377022463576238 Generator loss: 0.7070425565081432
--- EPOCH 98/100 --- 100%|█████████████████████████████████████████| 133/133 [00:30<00:00, 4.30it/s] Discriminator loss: 1.3909148614209397 Generator loss: 0.7345977582429585
--- EPOCH 99/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.48it/s] Discriminator loss: 1.3459629752582176 Generator loss: 0.7756604776346594
--- EPOCH 100/100 --- 100%|█████████████████████████████████████████| 133/133 [00:29<00:00, 4.48it/s] Discriminator loss: 1.3605147465727383 Generator loss: 0.7258370196012626
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = gen.sample(n=15, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(6,6))
*** Images Generated from best model:
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs236781.answers import display_answer
import hw4.answers as answers
Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?
display_answer(answers.part3_q1)
Your answer:
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?
What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?
display_answer(answers.part3_q2)
Your answer:
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?
display_answer(answers.part3_q3)
Your answer:
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
This section contains summary questions about various topics from the course material.
You can add your answers in new cells below the questions.
Notes
Answer:
Receptive field is the region in the input space which produces the feature in the following layers of the CNN.
Answer:
Increasing the number of convolutional layers. Each extra layer increases the receptive field size by the kernel size.
Adding pooling layers, which also reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. They increases the receptive field size multiplicatively.
Dilated convolutions. They introduce spacing between the values of a convolutional kernel, the number of weights in the kernel is unchanged. Increase the receptive field exponentially.
import torch
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
nn.ReLU(),
)
cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
torch.Size([1, 32, 122, 122])
What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?
Answer:
Using the fromulas:
$$ r_{out} = r_{in}+((k_{in} - 1) * \prod_{i=1}^{k-1}s_i) $$Kernels with dilation:
$$ k_{prev} = r * (k - 1) + 1 $$We will get:
Layer 1 (Conv2D): $$ k = 3, s = 1 $$ $$ R_1 = 1 + (3 - 1)*1 = 3 $$
Layer 2 (Pooloing): $$ R_2 = R_1 + (2 - 1) * 2 = 5 $$
Layer 3 (Conv2D): $$ R_3 = R_2 + (5 - 1) * 2^2 = 21 $$
Layer 4 (Pooloing): $$ R_4 = R_3 + (2 - 1) * 2^3 = 29 $$
Layer 5 (Conv2D): $$ R_5 = R_4 + (13 - 1) * 2^3 = 125 $$
The size of the receptive field of each "pixel" in the output tensor is [125 x 125]
You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).
After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.
However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.
Answer:
In residual networks we use skip connections to sum the output of a layer with the input of another deeper layer (after skiping few layers), this allows to create deeper networks and solves the problem of vanishing gradients. Optimization process is different, it results in different filters.
Answer:
False. It doesn't metter where it placed since it sets a fraction of units to be zero.
Answer:
When we apply the dropout only a fraction of the neurons is activated during training, while during the test we will activate all of them, thats why we need to do scaling to compensate.
Answer:
The default loss function used for classification task is binary cross-entropy, which maximizes the likelihood of classification. L2 loss measures the squered error between the prediction and the label, it's the default loss function for regression tasks.
$L_2 loss = \sum_{i=0}^{N} (y_i-y_i^{pred})^2$
$BCE loss = \frac{1}{N} \sum_{i=i}^{N} -(y_i\cdot log(p_i) + (1-y_i)\cdot log(1-p_i))$

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe.
You define your model as follows:
import torch.nn as nn
N = 42 # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
nn.Linear(in_features=N, out_features=H),
nn.Sigmoid(),
*[
nn.Linear(in_features=H, out_features=H),
nn.Sigmoid(),
]*N,
nn.Linear(in_features=H, out_features=1),
)
While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?
Answer:
The model is 42 layers deep and has no skip connections, also it uses sigmoid activation, which is good only for the final layer. The model is no longer training due to the vanishing gradients.
sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.Answer:
The range of a sigmoid is [0 ,1], the range of a tanh is [-1 ,1], it can help a little, but the model is so deep that it doesn't look like this increase can solve the problem of vanishing gradients.
Answer: True. But we still can get zero-nodes.
B. The gradient of ReLU is linear with its input when the input is positive.
Answer: False. The gradient is constant and equals to 1.
C. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.
Answer: True. For negative values it will be zero.
Answer: In GD the whole dataset used to calculate the average loss and make an update to the weights.
In Mini-batch SGD the loss calculation and weights update are made for a fraction of a dataset at each time. It takes one epoch to go over whole training set.
Stochastic gradient descent (SGD) calculates gradient for one point and backpropogates.
Answer:
A. 1 - Memory is limited. 2 - Can stuck at local minimum.
B. When the dataset is too big.
Answer:
It's difficult to know. On the one hand, we expect the number of iterations to decrease, because in each batch we are averaging over more samples, on the other hand too large batch size can lead to poor generalization.
Answer:
A. True. For every update we consider one sample.
B. False. Since we calculate for one sample at a time gradients are more affected by noise.
C. True. There is no chance that every sample in SGD will lead to the same local minimum.
D. False. In SGD we need a memomry to store only one sample.
E. False. We can't guarantee that.
F. False. Even though momentum prevents from SGD to oscilate in a narrow ravine, but in Newton's method the second derivative improves convergence more effectively.
Answer:
False: In tutorial we saw that there are cases when minimum can be found without using descent minimum, by analytical solution.
Answer:
A. Vanishing gradients happens when icreasingly small gradients backpropogate through the network for the update. Activation functions with plateu like sigmoid and tanh may lead to this problem. When we multiply low values by the chain rule multiple times the gradient becomes zero. Exploding gradients caused by very large derivative, the model becomes unstable.
B. Due to the chain rule.
C. If we assume 3 layer CNN, by the chain rule, for the first layer:
$\frac{d(f(f(f(x)))}{d(x)} = \frac{d(f(f(f(x)))}{d(f(f(x)))} \cdot \frac{df(f(x))}{d(f(x))} \cdot \frac{df(x)}{d(x))}$
If activation function is a power of high or low order, the gradints will explode or vanish respectively.
D. The loss will reach the plateau if the gradients are vanishing and oscilate if the gradients are exploding.
You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$
Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.
part4_affine_backward in hw4/answers.py so that it passes the asserts.from torch.autograd import Function
from hw4.answers import part4_affine_backward
N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)
def affine(X, W, b):
return 0.5 * X @ W.T + b
class AffineLayerFunction(Function):
@staticmethod
def forward(ctx, X, W, b):
result = affine(X, W, b)
ctx.save_for_backward(X, W, b)
return result
@staticmethod
def backward(ctx, grad_output):
return part4_affine_backward(ctx, grad_output)
l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
l1.backward()
W_grad1 = W.grad
b_grad1 = b.grad
l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad
assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)
Answer:
A. Word embeddings is a way to represent words as a tensor while the words that are close in the tensor space are expected to have similar meaning. It's used in language model to allow words to be processed by it and perform calculations.
B. No, because we need a way to make numerical calculations.
Y contain? why this output shape?nn.Embedding yourself using only torch tensors. import torch.nn as nn
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")
Answer:
A. Y contains embeddings of size of X with extra dimension of 42000.
B. We can write: embedding = torch.rand(size=(num_embeddings, embedding_dim)) and then: Y = torch.gather(input=embedding, dim=0, index=X).
Answer:
A. True. Herea losses are accumulated, and then the update made by using the accumulated gradients from all timesteps.
B. False. Input remains the same, the sequence for backpropogation changes.
C. False. We can learn relations between input that are more than S timesteps, we keep the hidden state of the previous sequence, which used then in the next sequence, so the output will depend on all timesteps.
In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on: